intelligibility prediction
Leveraging Multiple Speech Enhancers for Non-Intrusive Intelligibility Prediction for Hearing-Impaired Listeners
Cao, Boxuan, Li, Linkai, Yu, Hanlin, Mo, Changgeng, Zhou, Haoshuai, Wang, Shan Xiang
Speech intelligibility evaluation for hearing-impaired (HI) listeners is essential for assessing hearing aid performance, traditionally relying on listening tests or intrusive methods like HASPI. However, these methods require clean reference signals, which are often unavailable in real-world conditions, creating a gap between lab-based and real-world assessments. To address this, we propose a non-intrusive intelligibility prediction framework that leverages speech enhancers to provide a parallel enhanced-signal pathway, enabling robust predictions without reference signals. We evaluate three state-of-the-art enhancers and demonstrate that prediction performance depends on the choice of enhancer, with ensembles of strong enhancers yielding the best results. To improve cross-dataset generalization, we introduce a 2-clips augmentation strategy that enhances listener-specific variability, boosting robustness on unseen datasets. Our approach consistently outperforms the non-intrusive baseline, CPC2 Champion across multiple datasets, highlighting the potential of enhancer-guided non-intrusive intelligibility prediction for real-world applications.
- North America > United States (0.04)
- North America > Canada > British Columbia (0.04)
- Asia > China (0.04)
No Audiogram: Leveraging Existing Scores for Personalized Speech Intelligibility Prediction
Zhou, Haoshuai, Mo, Changgeng, Cao, Boxuan, Li, Linkai, Wang, Shan Xiang
Personalized speech intelligibility prediction is challenging. Previous approaches have mainly relied on audiograms, which are inherently limited in accuracy as they only capture a listener's hearing threshold for pure tones. Rather than incorporating additional listener features, we propose a novel approach that leverages an individual's existing intelligibility data to predict their performance on new audio. We introduce the Support Sample-Based Intelligibility Prediction Network (SSIPNet), a deep learning model that leverages speech foundation models to build a high-dimensional representation of a listener's speech recognition ability from multiple support (audio, score) pairs, enabling accurate predictions for unseen audio. Results on the Clarity Prediction Challenge dataset show that, even with a small number of support (audio, score) pairs, our method outperforms audiogram-based predictions. Our work presents a new paradigm for personalized speech intelligibility prediction.
- North America > United States > Florida > Hillsborough County > University (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (2 more...)
Unveiling the Best Practices for Applying Speech Foundation Models to Speech Intelligibility Prediction for Hearing-Impaired People
Zhou, Haoshuai, Cao, Boxuan, Mo, Changgeng, Li, Linkai, Wang, Shan Xiang
Speech foundation models (SFMs) have demonstrated strong performance across a variety of downstream tasks, including speech intelligibility prediction for hearing-impaired people (SIP-HI). However, optimizing SFMs for SIP-HI has been insufficiently explored. In this paper, we conduct a comprehensive study to identify key design factors affecting SIP-HI performance with 5 SFMs, focusing on encoder layer selection, prediction head architecture, and ensemble configurations. Our findings show that, contrary to traditional use-all-layers methods, selecting a single encoder layer yields better results. Additionally, temporal modeling is crucial for effective prediction heads. We also demonstrate that ensembling multiple SFMs improves performance, with stronger individual models providing greater benefit. Finally, we explore the relationship between key SFM attributes and their impact on SIP-HI performance. Our study offers practical insights into effectively adapting SFMs for speech intelligibility prediction for hearing-impaired populations.
- North America > United States > Florida > Hillsborough County > University (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > China (0.04)
Transfer Learning from Whisper for Microscopic Intelligibility Prediction
Best, Paul, Cuervo, Santiago, Marxer, Ricard
Macroscopic intelligibility models predict the expected human word-error-rate for a given speech-in-noise stimulus. In contrast, microscopic intelligibility models aim to make fine-grained predictions about listeners' perception, e.g. predicting phonetic or lexical responses. State-of-the-art macroscopic models use transfer learning from large scale deep learning models for speech processing, whereas such methods have rarely been used for microscopic modeling. In this paper, we study the use of transfer learning from Whisper, a state-of-the-art deep learning model for automatic speech recognition, for microscopic intelligibility prediction at the level of lexical responses. Our method outperforms the considered baselines, even in a zero-shot setup, and yields a relative improvement of up to 66\% when fine-tuned to predict listeners' responses. Our results showcase the promise of large scale deep learning based methods for microscopic intelligibility prediction.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > Germany > Berlin (0.04)
Speech foundation models on intelligibility prediction for hearing-impaired listeners
Cuervo, Santiago, Marxer, Ricard
Speech foundation models (SFMs) have been benchmarked on many speech processing tasks, often achieving state-of-the-art performance with minimal adaptation. However, the SFM paradigm has been significantly less explored for applications of interest to the speech perception community. In this paper we present a systematic evaluation of 10 SFMs on one such application: Speech intelligibility prediction. We focus on the non-intrusive setup of the Clarity Prediction Challenge 2 (CPC2), where the task is to predict the percentage of words correctly perceived by hearing-impaired listeners from speech-in-noise recordings. We propose a simple method that learns a lightweight specialized prediction head on top of frozen SFMs to approach the problem. Our results reveal statistically significant differences in performance across SFMs. Our method resulted in the winning submission in the CPC2, demonstrating its promise for speech perception applications.
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > France > Auvergne-Rhône-Alpes > Lyon > Lyon (0.04)